GOAL: To be able to use the 6 data frame manipulation in dplyr.
setwd('~/workshops/data-workshops/notebooks/dplyr-tidyr/')
#gapminder <- read.csv("https://goo.gl/BtBnPg", header = T)
gapminder <- read.csv("data/gapminder.csv", header = T)
Manipulation of dataframes means many things to many researchers,
We can do these operations using the normal base R operations:
mean(gapminder[gapminder$continent == "Africa", "gdpPercap"])
## [1] 2193.755
mean(gapminder[gapminder$continent == "Americas", "gdpPercap"])
## [1] 7136.11
mean(gapminder[gapminder$continent == "Asia", "gdpPercap"])
## [1] 7902.15
package provides a number of very useful functions for manipulating dataframes in a way that will * reduces repetition * introduces functions to help * dplyr grammar easier to read
Here we’re going to cover 6 of the most commonly used functions as well as using pipes (%>%) to combine them.
Install this package:
#install.packages('dplyr')
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
head(gapminder)
If we want to select just a few variables in our data frame.
year_country_gdp <- select(gapminder, year, country, gdpPercap)
Show fig 1 in lesson.
* If we open up year_country_gdp we’ll see that it only contains the
year, country and gdpPercap. Above we used ‘normal’ grammar, but the strengths of dplyr lie in combining several functions using pipes. Since the pipes grammar is unlike anything we’ve seen in R before, let’s repeat what we’ve done above using pipes.
Using pipes (%>%), forward pipe left to right, obviates need to include dataframe
year_country_gdp <- gapminder %>% select(year,country,gdpPercap)
#base-r way
year_country_gdp_br <- gapminder[, c("year", "country", "gdpPercap")]
If we now wanted to move forward with the above, but only with European countries, we can combine select and filter
year_country_gdp_euro <- gapminder %>%
filter(continent == "Europe") %>%
select(year, country, gdpPercap)
http://swcarpentry.github.io/r-novice-gapminder/13-dplyr#challenge-1
str(gapminder)
## 'data.frame': 1704 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ gdpPercap: num 779 821 853 836 740 ...
str(gapminder %>% group_by(continent))
## Classes 'grouped_df', 'tbl_df', 'tbl' and 'data.frame': 1704 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ gdpPercap: num 779 821 853 836 740 ...
## - attr(*, "vars")=List of 1
## ..$ : symbol continent
## - attr(*, "drop")= logi TRUE
## - attr(*, "indices")=List of 5
## ..$ : int 24 25 26 27 28 29 30 31 32 33 ...
## ..$ : int 48 49 50 51 52 53 54 55 56 57 ...
## ..$ : int 0 1 2 3 4 5 6 7 8 9 ...
## ..$ : int 12 13 14 15 16 17 18 19 20 21 ...
## ..$ : int 60 61 62 63 64 65 66 67 68 69 ...
## - attr(*, "group_sizes")= int 624 300 396 360 24
## - attr(*, "biggest_group_size")= int 624
## - attr(*, "labels")='data.frame': 5 obs. of 1 variable:
## ..$ continent: Factor w/ 5 levels "Africa","Americas",..: 1 2 3 4 5
## ..- attr(*, "vars")=List of 1
## .. ..$ : symbol continent
## ..- attr(*, "drop")= logi TRUE
gapminder-group
gdp_bycontinents <- gapminder %>%
group_by(continent) %>%
summarize(mean_gdpPercap=mean(gdpPercap))
gap-summarize
http://swcarpentry.github.io/r-novice-gapminder/13-dplyr#challenge-2
gdp_bycontinents_byyear <- gapminder %>%
group_by(continent,year) %>%
summarize(mean_gdpPercap=mean(gdpPercap))
gdp_pop_bycontinents_byyear <- gapminder %>%
group_by(continent,year) %>%
summarize(mean_gdpPercap=mean(gdpPercap),
sd_gdpPercap=sd(gdpPercap),
mean_pop=mean(pop),
sd_pop=sd(pop))
gdp_pop_bycontinents_byyear <- gapminder %>%
mutate(gdp_billion=gdpPercap*pop/10^9) %>%
group_by(continent,year) %>%
summarize(mean_gdpPercap=mean(gdpPercap),
sd_gdpPercap=sd(gdpPercap),
mean_pop=mean(pop),
sd_pop=sd(pop),
mean_gdp_billion=mean(gdp_billion),
sd_gdp_billion=sd(gdp_billion))
http://swcarpentry.github.io/r-novice-gapminder/13-dplyr#advanced-challenge
https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf
Researchers often want to manipulate their data from the ‘wide’ to the ‘long’ format, or vice-versa. The ‘long’ format is where:
each row is an observation
In the ‘long’ format, you usually have 1 column for the observed variable and the other columns are ID variables.
However, many of R’s functions have been designed assuming you have ‘long’ format data. This tutorial will help you efficiently transform your data regardless of original format.
widevlong
#install.packages("tidyr")
#install.packages("dplyr")
library("tidyr")
library("dplyr")
str(gapminder)
## 'data.frame': 1704 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ gdpPercap: num 779 821 853 836 740 ...
http://swcarpentry.github.io/r-novice-gapminder/14-tidyr#challenge-1
There are few operations that would need us to stretch out this dataframe any longer (i.e. 4 ID variables and 1 Observation variable).
Note: Some plotting functions in R actually work better in the wide format data.
Until now, we’ve been using the nicely formatted original gapminder dataset, but ‘real’ data (i.e. our own research data) will never be so well organized. Here let’s start with the wide format version of the gapminder dataset.
We’ll load the data file and look at it. Note: we don’t want our continent and country columns to be factors, so we use the stringsAsFactors argument for read.csv() to disable that.
download.file('https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/gh-pages/_episodes_rmd/data/gapminder_wide.csv', 'data/gapminder_wide.csv')
gap_wide <- read.csv("data/gapminder_wide.csv", stringsAsFactors = FALSE)
str(gap_wide)
## 'data.frame': 142 obs. of 38 variables:
## $ continent : chr "Africa" "Africa" "Africa" "Africa" ...
## $ country : chr "Algeria" "Angola" "Benin" "Botswana" ...
## $ gdpPercap_1952: num 2449 3521 1063 851 543 ...
## $ gdpPercap_1957: num 3014 3828 960 918 617 ...
## $ gdpPercap_1962: num 2551 4269 949 984 723 ...
## $ gdpPercap_1967: num 3247 5523 1036 1215 795 ...
## $ gdpPercap_1972: num 4183 5473 1086 2264 855 ...
## $ gdpPercap_1977: num 4910 3009 1029 3215 743 ...
## $ gdpPercap_1982: num 5745 2757 1278 4551 807 ...
## $ gdpPercap_1987: num 5681 2430 1226 6206 912 ...
## $ gdpPercap_1992: num 5023 2628 1191 7954 932 ...
## $ gdpPercap_1997: num 4797 2277 1233 8647 946 ...
## $ gdpPercap_2002: num 5288 2773 1373 11004 1038 ...
## $ gdpPercap_2007: num 6223 4797 1441 12570 1217 ...
## $ lifeExp_1952 : num 43.1 30 38.2 47.6 32 ...
## $ lifeExp_1957 : num 45.7 32 40.4 49.6 34.9 ...
## $ lifeExp_1962 : num 48.3 34 42.6 51.5 37.8 ...
## $ lifeExp_1967 : num 51.4 36 44.9 53.3 40.7 ...
## $ lifeExp_1972 : num 54.5 37.9 47 56 43.6 ...
## $ lifeExp_1977 : num 58 39.5 49.2 59.3 46.1 ...
## $ lifeExp_1982 : num 61.4 39.9 50.9 61.5 48.1 ...
## $ lifeExp_1987 : num 65.8 39.9 52.3 63.6 49.6 ...
## $ lifeExp_1992 : num 67.7 40.6 53.9 62.7 50.3 ...
## $ lifeExp_1997 : num 69.2 41 54.8 52.6 50.3 ...
## $ lifeExp_2002 : num 71 41 54.4 46.6 50.6 ...
## $ lifeExp_2007 : num 72.3 42.7 56.7 50.7 52.3 ...
## $ pop_1952 : num 9279525 4232095 1738315 442308 4469979 ...
## $ pop_1957 : num 10270856 4561361 1925173 474639 4713416 ...
## $ pop_1962 : num 11000948 4826015 2151895 512764 4919632 ...
## $ pop_1967 : num 12760499 5247469 2427334 553541 5127935 ...
## $ pop_1972 : num 14760787 5894858 2761407 619351 5433886 ...
## $ pop_1977 : num 17152804 6162675 3168267 781472 5889574 ...
## $ pop_1982 : num 20033753 7016384 3641603 970347 6634596 ...
## $ pop_1987 : num 23254956 7874230 4243788 1151184 7586551 ...
## $ pop_1992 : num 26298373 8735988 4981671 1342614 8878303 ...
## $ pop_1997 : num 29072015 9875024 6066080 1536536 10352843 ...
## $ pop_2002 : int 31287142 10866106 7026113 1630347 12251209 7021078 15929988 4048013 8835739 614382 ...
## $ pop_2007 : int 33333216 12420476 8078314 1639131 14326203 8390505 17696293 4369038 10238807 710960 ...
* The first step towards getting our nice intermediate data format is to first convert from the wide to the long format. * The tidyr function
gather() will ‘gather’ your observation variables into a single variable.
gap_long <- gap_wide %>%
gather(obstype_year, obs_values, starts_with('pop'),
starts_with('lifeExp'), starts_with('gdpPercap'))
str(gap_long)
## 'data.frame': 5112 obs. of 4 variables:
## $ continent : chr "Africa" "Africa" "Africa" "Africa" ...
## $ country : chr "Algeria" "Angola" "Benin" "Botswana" ...
## $ obstype_year: chr "pop_1952" "pop_1952" "pop_1952" "pop_1952" ...
## $ obs_values : num 9279525 4232095 1738315 442308 4469979 ...
in fact, these are compatible and you can use a mix of tidyr and dplyr funcitons togeth
gather() we first name the new column for the new ID variable (obstyper_year), the name for the new amalgamated observation variable (obs_value), then the names of the old observation variables.select() function in dplyr, we can use starts_with() argument to select all variables that start with the desired stringgather also allows alternate syntax of using the - symbol to identify which variabels are not to be gathered (i.e. ID variables)
longformat
gap_long <- gap_wide %>% gather(obstype_year,obs_values,-continent,-country)
str(gap_long)
## 'data.frame': 5112 obs. of 4 variables:
## $ continent : chr "Africa" "Africa" "Africa" "Africa" ...
## $ country : chr "Algeria" "Angola" "Benin" "Botswana" ...
## $ obstype_year: chr "gdpPercap_1952" "gdpPercap_1952" "gdpPercap_1952" "gdpPercap_1952" ...
## $ obs_values : num 2449 3521 1063 851 543 ...
flexibility is a big time saver!
obstype_year contains 2 pieces of information, the observation type(pop, lifeExp, or gdpPercap) and the year.we can use the separate() function to split the character strings into multiple variables
gap_long <- gap_long %>% separate(obstype_year,into=c('obs_type','year'),sep="_")
gap_long$year <- as.integer(gap_long$year)
http://swcarpentry.github.io/r-novice-gapminder/14-tidyr#challenge-2
spread()gather to spread our observation variables back outgap_long() to the original intermediate format or the widest formatgap_normal <- gap_long %>% spread(obs_type,obs_values)
dim(gap_normal)
## [1] 1704 6
names(gap_normal)
## [1] "continent" "country" "year" "gdpPercap" "lifeExp" "pop"
names(gapminder)
## [1] "country" "year" "pop" "continent" "lifeExp" "gdpPercap"
Now we’ve got an intermediate dataframe gap_normal with the same dimensions as the original gapminder, but the order of the variables is different. Let’s fix that before checking if they are all.equal().
gap_normal <- gap_normal[,names(gapminder)]
all.equal(gap_normal,gapminder)
## [1] "Component \"country\": Modes: character, numeric"
## [2] "Component \"country\": Attributes: < target is NULL, current is list >"
## [3] "Component \"country\": target is character, current is factor"
## [4] "Component \"pop\": Mean relative difference: 1.634504"
## [5] "Component \"continent\": Modes: character, numeric"
## [6] "Component \"continent\": Attributes: < target is NULL, current is list >"
## [7] "Component \"continent\": target is character, current is factor"
## [8] "Component \"lifeExp\": Mean relative difference: 0.203822"
## [9] "Component \"gdpPercap\": Mean relative difference: 1.162302"
head(gap_normal)
## country year pop continent lifeExp gdpPercap
## 1 Algeria 1952 9279525 Africa 43.077 2449.008
## 2 Algeria 1957 10270856 Africa 45.685 3013.976
## 3 Algeria 1962 11000948 Africa 48.303 2550.817
## 4 Algeria 1967 12760499 Africa 51.407 3246.992
## 5 Algeria 1972 14760787 Africa 54.518 4182.664
## 6 Algeria 1977 17152804 Africa 58.014 4910.417
head(gapminder)
## country year pop continent lifeExp gdpPercap
## 1 Afghanistan 1952 8425333 Asia 28.801 779.4453
## 2 Afghanistan 1957 9240934 Asia 30.332 820.8530
## 3 Afghanistan 1962 10267083 Asia 31.997 853.1007
## 4 Afghanistan 1967 11537966 Asia 34.020 836.1971
## 5 Afghanistan 1972 13079460 Asia 36.088 739.9811
## 6 Afghanistan 1977 14880372 Asia 38.438 786.1134
We’re almost there, the original was sorted by country, continent, then year.
gap_normal <- gap_normal %>% arrange(country,continent,year)
all.equal(gap_normal,gapminder)
## [1] "Component \"country\": Modes: character, numeric"
## [2] "Component \"country\": Attributes: < target is NULL, current is list >"
## [3] "Component \"country\": target is character, current is factor"
## [4] "Component \"continent\": Modes: character, numeric"
## [5] "Component \"continent\": Attributes: < target is NULL, current is list >"
## [6] "Component \"continent\": target is character, current is factor"
gap_temp <- gap_long %>% unite(var_ID,continent,country,sep="_")
str(gap_temp)
## 'data.frame': 5112 obs. of 4 variables:
## $ var_ID : chr "Africa_Algeria" "Africa_Angola" "Africa_Benin" "Africa_Botswana" ...
## $ obs_type : chr "gdpPercap" "gdpPercap" "gdpPercap" "gdpPercap" ...
## $ year : int 1952 1952 1952 1952 1952 1952 1952 1952 1952 1952 ...
## $ obs_values: num 2449 3521 1063 851 543 ...
gap_temp <- gap_long %>%
unite(ID_var,continent,country,sep="_") %>%
unite(var_names,obs_type,year,sep="_")
str(gap_temp)
## 'data.frame': 5112 obs. of 3 variables:
## $ ID_var : chr "Africa_Algeria" "Africa_Angola" "Africa_Benin" "Africa_Botswana" ...
## $ var_names : chr "gdpPercap_1952" "gdpPercap_1952" "gdpPercap_1952" "gdpPercap_1952" ...
## $ obs_values: num 2449 3521 1063 851 543 ...
gap_wide_new <- gap_long %>%
unite(ID_var,continent,country,sep="_") %>%
unite(var_names,obs_type,year,sep="_") %>%
spread(var_names,obs_values)
str(gap_wide_new)
## 'data.frame': 142 obs. of 37 variables:
## $ ID_var : chr "Africa_Algeria" "Africa_Angola" "Africa_Benin" "Africa_Botswana" ...
## $ gdpPercap_1952: num 2449 3521 1063 851 543 ...
## $ gdpPercap_1957: num 3014 3828 960 918 617 ...
## $ gdpPercap_1962: num 2551 4269 949 984 723 ...
## $ gdpPercap_1967: num 3247 5523 1036 1215 795 ...
## $ gdpPercap_1972: num 4183 5473 1086 2264 855 ...
## $ gdpPercap_1977: num 4910 3009 1029 3215 743 ...
## $ gdpPercap_1982: num 5745 2757 1278 4551 807 ...
## $ gdpPercap_1987: num 5681 2430 1226 6206 912 ...
## $ gdpPercap_1992: num 5023 2628 1191 7954 932 ...
## $ gdpPercap_1997: num 4797 2277 1233 8647 946 ...
## $ gdpPercap_2002: num 5288 2773 1373 11004 1038 ...
## $ gdpPercap_2007: num 6223 4797 1441 12570 1217 ...
## $ lifeExp_1952 : num 43.1 30 38.2 47.6 32 ...
## $ lifeExp_1957 : num 45.7 32 40.4 49.6 34.9 ...
## $ lifeExp_1962 : num 48.3 34 42.6 51.5 37.8 ...
## $ lifeExp_1967 : num 51.4 36 44.9 53.3 40.7 ...
## $ lifeExp_1972 : num 54.5 37.9 47 56 43.6 ...
## $ lifeExp_1977 : num 58 39.5 49.2 59.3 46.1 ...
## $ lifeExp_1982 : num 61.4 39.9 50.9 61.5 48.1 ...
## $ lifeExp_1987 : num 65.8 39.9 52.3 63.6 49.6 ...
## $ lifeExp_1992 : num 67.7 40.6 53.9 62.7 50.3 ...
## $ lifeExp_1997 : num 69.2 41 54.8 52.6 50.3 ...
## $ lifeExp_2002 : num 71 41 54.4 46.6 50.6 ...
## $ lifeExp_2007 : num 72.3 42.7 56.7 50.7 52.3 ...
## $ pop_1952 : num 9279525 4232095 1738315 442308 4469979 ...
## $ pop_1957 : num 10270856 4561361 1925173 474639 4713416 ...
## $ pop_1962 : num 11000948 4826015 2151895 512764 4919632 ...
## $ pop_1967 : num 12760499 5247469 2427334 553541 5127935 ...
## $ pop_1972 : num 14760787 5894858 2761407 619351 5433886 ...
## $ pop_1977 : num 17152804 6162675 3168267 781472 5889574 ...
## $ pop_1982 : num 20033753 7016384 3641603 970347 6634596 ...
## $ pop_1987 : num 23254956 7874230 4243788 1151184 7586551 ...
## $ pop_1992 : num 26298373 8735988 4981671 1342614 8878303 ...
## $ pop_1997 : num 29072015 9875024 6066080 1536536 10352843 ...
## $ pop_2002 : num 31287142 10866106 7026113 1630347 12251209 ...
## $ pop_2007 : num 33333216 12420476 8078314 1639131 14326203 ...
http://swcarpentry.github.io/r-novice-gapminder/14-tidyr#challenge-3
Now we have a great ‘wide’ format dataframe, but the ID_var could be more usable, let’s separate it into 2 variables with separate()
gap_wide_betterID <- separate(gap_wide_new,ID_var,c("continent","country"),sep="_")
gap_wide_betterID <- gap_long %>%
unite(ID_var, continent,country,sep="_") %>%
unite(var_names, obs_type,year,sep="_") %>%
spread(var_names, obs_values) %>%
separate(ID_var, c("continent","country"),sep="_")
str(gap_wide_betterID)
## 'data.frame': 142 obs. of 38 variables:
## $ continent : chr "Africa" "Africa" "Africa" "Africa" ...
## $ country : chr "Algeria" "Angola" "Benin" "Botswana" ...
## $ gdpPercap_1952: num 2449 3521 1063 851 543 ...
## $ gdpPercap_1957: num 3014 3828 960 918 617 ...
## $ gdpPercap_1962: num 2551 4269 949 984 723 ...
## $ gdpPercap_1967: num 3247 5523 1036 1215 795 ...
## $ gdpPercap_1972: num 4183 5473 1086 2264 855 ...
## $ gdpPercap_1977: num 4910 3009 1029 3215 743 ...
## $ gdpPercap_1982: num 5745 2757 1278 4551 807 ...
## $ gdpPercap_1987: num 5681 2430 1226 6206 912 ...
## $ gdpPercap_1992: num 5023 2628 1191 7954 932 ...
## $ gdpPercap_1997: num 4797 2277 1233 8647 946 ...
## $ gdpPercap_2002: num 5288 2773 1373 11004 1038 ...
## $ gdpPercap_2007: num 6223 4797 1441 12570 1217 ...
## $ lifeExp_1952 : num 43.1 30 38.2 47.6 32 ...
## $ lifeExp_1957 : num 45.7 32 40.4 49.6 34.9 ...
## $ lifeExp_1962 : num 48.3 34 42.6 51.5 37.8 ...
## $ lifeExp_1967 : num 51.4 36 44.9 53.3 40.7 ...
## $ lifeExp_1972 : num 54.5 37.9 47 56 43.6 ...
## $ lifeExp_1977 : num 58 39.5 49.2 59.3 46.1 ...
## $ lifeExp_1982 : num 61.4 39.9 50.9 61.5 48.1 ...
## $ lifeExp_1987 : num 65.8 39.9 52.3 63.6 49.6 ...
## $ lifeExp_1992 : num 67.7 40.6 53.9 62.7 50.3 ...
## $ lifeExp_1997 : num 69.2 41 54.8 52.6 50.3 ...
## $ lifeExp_2002 : num 71 41 54.4 46.6 50.6 ...
## $ lifeExp_2007 : num 72.3 42.7 56.7 50.7 52.3 ...
## $ pop_1952 : num 9279525 4232095 1738315 442308 4469979 ...
## $ pop_1957 : num 10270856 4561361 1925173 474639 4713416 ...
## $ pop_1962 : num 11000948 4826015 2151895 512764 4919632 ...
## $ pop_1967 : num 12760499 5247469 2427334 553541 5127935 ...
## $ pop_1972 : num 14760787 5894858 2761407 619351 5433886 ...
## $ pop_1977 : num 17152804 6162675 3168267 781472 5889574 ...
## $ pop_1982 : num 20033753 7016384 3641603 970347 6634596 ...
## $ pop_1987 : num 23254956 7874230 4243788 1151184 7586551 ...
## $ pop_1992 : num 26298373 8735988 4981671 1342614 8878303 ...
## $ pop_1997 : num 29072015 9875024 6066080 1536536 10352843 ...
## $ pop_2002 : num 31287142 10866106 7026113 1630347 12251209 ...
## $ pop_2007 : num 33333216 12420476 8078314 1639131 14326203 ...
all.equal(gap_wide, gap_wide_betterID)
## [1] TRUE
There and back again!